D490-D495 Nucleic Acids Research, 2014, Vol. 42, Database issue 
doi:10.1093lnar/gktl 1 78 



Published online 21 November 2013 



The carbohydrate-active enzymes database 
(CAZy) in 2013 

Vincent Lombard^'^ Hemalatha Golaconda Ramulu^'^ Elodie Drula^'^ 
Pedro M. Coutinho^'^ and Bernard Henrissat^ * 

"'Centre National de la Recherche Scientifique, CNRS UMR 7257, 13288 Marseille, France and ^Aix-Marseille 
Universite, AFMB, 163 Avenue de Luminy, 13288 Marseille, France 

Received September 20, 2013; Revised October 30, 2013; Accepted October 31, 2013 



ABSTRACT 

The Carbohydrate-Active Enzymes database (CAZy; 
http://www.cazy.org) provides online and continu- 
ously updated access to a sequence-based family 
classification linking the sequence to the specificity 
and 3D structure of the enzymes that assemble, 
modify and breakdown oligo- and polysaccharides. 
Functional and 3D structural information is added 
and curated on a regular basis based on the avail- 
able literature. In addition to the use of the database 
by enzymologists seeking curated information on 
CAZymes, the dissemination of a stable nomencla- 
ture for these enzymes is probably a major contri- 
bution of CAZy. The past few years have seen the 
expansion of the CAZy classification scheme to new 
families, the development of subfamilies in several 
families and the power of CAZy for the analysis of 
genomes and metagenomes. This article outlines 
the changes that have occurred in CAZy during the 
past 5 years and presents our novel effort to display 
the resolution and the carbohydrate ligands in crys- 
tallographic complexes of CAZymes. 

INTRODUCTION 

Despite their similar chemical composition, carbohydrates 
can form an enormous number of combinations through 
the stereochemical variety of the hydroxyl groups that 
they carry, through the many possibilities to assemble 
monosaccharides one to another, and through the 
wealth of noncarbohydrate substituents that can 
decorate the resulting ohgo- and polysaccharides. 
Complex carbohydrates are widely distributed in nature, 
where they mediate a multitude of biological functions, 
from carbon reserve, to structural molecules, or as the 



mediators of intra- and intercellular recognition within 
one organism or between organisms. The diversity of com- 
plex carbohydrates is controlled by a panel of enzymes 
involved in their assembly (glycosyltransferases) and 
their breakdown (glycoside hydrolases, polysaccharide 
lyases, carbohydrate esterases), collectively designated as 
Carbohydrate- Active enZymes (CAZymes). CAZymes 
have been classified in sequence-based famihes for >22 
years (1-6) and this classification has become the 
standard of the field (7). 

The first defining feature of CAZyme classification is 
that the families are defined based on significant amino 
acid sequence similarity with at least one biochemically 
characterized founding member (1). The consequence is 
that sequences that display too little similarity to ensure 
a significant ahgnment are not included, nor used to form 
putative families, as distant relatives of CAZymes may 
have other functions. Borderline cases are stored in the 
nonclassified section of each CAZyme category, awaiting 
biochemical characterization. A second defining feature is 
that our classification is made module by module. 
CAZymes are frequently modular proteins with a catalytic 
module harbouring a variable number of other discrete 
modules, which can be either catalytic or not. Thus a 
modular CAZyme can be assigned to several families if 
its constitutive modules belong to separate families. The 
third important feature is that we only analyse systemat- 
ically protein sequences released in the daily releases of 
GenBank (ftp://ftp.ncbi.nih.gov/genbank/daily-nc), to 
avoid analysing unfinished sequences that may change 
accession number. 

As early as 1991, it was noted that the sequence-based 
famihes of glycoside hydrolases grouped together enzymes 
of different substrate specificities (i.e. enzymes with 'dif- 
ferent' EC numbers) (1) demonstrating that the acquisi- 
tion of novel specificity has been commonplace during 
evolution. This feature was subsequently noted for the 
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other classes of CAZymes (4,6). The processes by which a 
novel substrate specificity was acquired from a common 
ancestor leave detectable traces in the sequence of contem- 
porary proteins. Thus, unexpectedly, the usual drawback 
of carbohydrates (their chemical resemblance) is at the 
origin of their success in the postgenomic era: CAZymes 
need to be specific to perform their biological functions. 
While the precise specificity of DNAses, RNAses, prote- 
ases and esterases is difficult or impossible to derive from 
their sequence alone, the CAZyme classification system 
allows in some cases the prediction of the broad 
category of carbohydrate substrate, based on the assign- 
ment to a family (8). This carries the potential to infer the 
glycobiological profile of an organism (or a community 
thereof) based on DNA sequence. However, the occur- 
rence of enzymes that act on different substrates in the 
same family is a significant problem for the automated 
functional annotation of CAZyme-related genes. This 
can sometimes be overcome by the definition of 
subfamihes within famihes (9,10) (see below), but our 
current knowledge of the sequence-to-specificity relation- 
ships in CAZymes families is still largely insufficient and 
unevenly distributed for many families to allow unsuper- 
vised automated substrate prediction. 

The Carbohydrate-Active Enzymes database (CAZy; 
http://www.cazy.org) was launched in 1999 to provide 
onhne and constantly updated access to the family classi- 
fication of CAZymes. Coupled to the CAZypedia encyclo- 
paedic resource (http://www.cazypedia.org), CAZy is the 
only comprehensive resource that correlates the sequence, 
structure and molecular mechanism of CAZymes. CAZy 
was presented in this journal in 2009 (1 1) and the present 
article outlines the changes that have been implemented in 
CAZy during the past 5 years. 



WEBSITE DESIGN 

In March 20 11, the website interface was deeply redesigned 
both in appearance (new layout, new colours and new 
logo) and in content. Thus new sections and new links 
have been added to commercial providers that hst their 
products following the CAZy nomenclature. Other add- 
itions cover scientific meetings relevant to CAZymes, pos- 
itions available and a 'what's new' section that provides 
news on changes in the CAZy database. More interactivity 
in the display of information associated with each family 
was introduced (Figure 1). In particular, each family has 
now a specific tab, which lists those individual CAZymes 
that we beheve have been experimentally characterized. 
Because the number of entries in several families had 
become impractical, the display was modified to just 
show the header for each family along with a series of 
tabs for access to subsets (All, Archea, Bacteria, 
Eukaryota, unclassified, Structure, Characterized). Each 
tab displays 1000 entries per page, except for the tab 
listing the characterized enzymes, where only 100 entries 
are shown per page. The search tool was also revisited and 
one can now search the entire site or specific fields such as 
CAZy family, taxonomic identifier, organism name, 
protein name, accessions in different databases 



(GenBank, Uniprot and Protein Data Bank (PDB)), 
known activities, EC number, mechanisms or clan. 

NOVEL ENZYME CLASS 

Because lignin is invariably found together with polysac- 
charides in the plant cell wall and because lignin fragments 
are hkely to act in concert with polysaccharide lytic mono- 
oxygenases (LPMO), famihes of hgnin degradation 
enzymes and of LPMOs have been used to define a new 
CAZy class that we have named 'Auxihary Activities' to 
accommodate a broad range of enzyme mechanisms and 
substrates related to lignocellulose conversion (12). 

DATABASE GROWTH 

At the date of submission of this article, CAZy reports 
sequence information on almost 340000 CAZymes, a 
staggering 225% increase compared with 5 years ago 
(Table 1). During the same period, the number of bio- 
chemically characterized CAZymes has grown by only 
30% to 12 700 and the number of CAZymes with 3D 
structures has grown by ~78% (Table 1). Despite this 
growth, only -1400 (0.4%) of the 340000 CAZymes 
have a 3D structure solved to date. The past 5 years 
have seen the number of families covered by CAZy grow 
slowly to >330 at present. Five years ago the number of 
genome sequences analysed in CAZy was 750 (11). This 
number is now greater than 2800 (see below), representing 
a 3.8-fold increase. The continuously growing gap between 
the number of sequences and the number of biochemically 
or structurally characterized CAZymes is a direct conse- 
quence of the avalanche of genome sequences resulting 
from modern sequencing technologies combined with the 
much lower pace of experimental characterization of gene 
products. This gap would even be more considerable if one 
was to search and hst CAZymes in nonfinished genomes. 

DATABASE CONTENT: SUBFAMILIES 

The occurrence of enzymes that act on different substrates 
in the same family prevents the straightforward functional 
annotation of CAZyme-related genes. The division of 
CAZyme families into subfamilies based on phylogenetic 
analysis has been explored as a possible approach to 
improve the relationship between sequence and specificity. 
Subfamily classification of GH5, GH13, GH30 and all of 
the PL families has shown that the majority of the defined 
subfamihes are monospecific, thus indicating that the cor- 
relation of substrate specificity with sequences is signifi- 
cantly better at the subfamily level than the family level 
(9,10,13). An additional benefit of the division 
into subfamilies is the identification of currently 
uncharacterized subfamilies that can subsequently be 
analysed experimentally to unveil potential new activities. 
Subfamilies are currently displayed for families GH5, 
GH13, GH30, AA1-AA5 and for all PL families. Many 
more families are currently evaluated for subfamily defin- 
itions. Care is taken that the subfamilies are defined in a 
robust manner to avoid confusion that would arise from 
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Table 1. Growth of the CAZy database during the past 5 years 



Protein class 


Sequences Sept-2013 


Dec-2008 


Characterized Sept-2013 


Dec-2008 


With structure Sept-2013 


Dec-2008 


GH 


159274 


46654 


9221 


6805 


817 


475 


GT 


119910 


40 863 


1936 


1846 


139 


83 


PL 


4043 


1301 


336 


262 


51 


34 


CE 


15 856 


5083 


275 


212 


74 


43 


CBM 


32 259 


9210 


663 


570 


280 


166 


AA 


5801 


464" 


299 


71" 


58 


3" 


Total 


337 143 


103 111 


12 730 


9695 


1419 


801 



"Numbers estimated from the literature: the AA category did not exist in December 2008. 



constant redefinitions and resulting different naming con- 
ventions. We prefer to let the subfamilies 'mature' until we 
feel that the subfamily quality and stabiUty is sufficient for 
public release. 



DATABASE CONTENT: GENOMES 

The collection of carbohydrate-active enzymes encoded by 
the genome of an organism ('CAZome') provides an 
insight into the nature and extent of the metabolism of 
complex carbohydrates of the species. The CAZomes of 
free-living organisms typically correspond to 1-5% of 
the predicted coding sequences. Extremely reduced 
CAZomes are characteristic of species with a strict intra- 
cellular parasitic lifestyle. Because of the massive 
chemical, structural and functional variabihty of carbohy- 
drates, CAZome comparisons can highlight the adapta- 
tion of the CAZymes repertoire of species to their 
environment (14,15). 

Since 201 1, in addition to giving the family distribution, 
the new CAZy website displays the complete list of 
putative CAZymes (with accession numbers) of each 
genome that was analysed. At present, CAZy covers 
>2800 genomes in the following kingdoms: Bacteria 
(2351), Archea (158), Eukaryota (73), Viruses (240). The 
CAZomes listed in the CAZy website correspond to 
protein models of finished genomes, i.e. with proteins 
released in the daily releases of GenBank (ftp://ftp.ncbi. 
nih.gov/genbank/daily-nc). In a few cases, genomes with 
protein models not released as finished entries in GenBank 
but publically available, have been analysed and are pre- 
sented in CAZy. However, for these few cases, the display 
only shows the number of proteins in each family, but 
does not feature the actual list of proteins. 

Genomes are analysed using the CAZy pipeline, which 
combines Blast and HMM tools to compare protein 
models, respectively, with sequence and profile hbraries 
created from the sequences of the catalytic and noncata- 
lytic modules of the CAZy database. This is followed by a 
manual inspection by expert curators to resolve borderiine 
cases (11). Our methodology provides coherent, expert 
and comparable sets of annotations. In this respect, one 
should note that the correspondence between CAZy 
families and those in PFAM (16)/INTERPRO (17) or 
DBCAN (18) is far from perfect. This is due to a variety 
of reasons that include different strategies, different 
thresholds, different goals, different methods, different 



training sets and different degrees of expert curation. An 
unfortunate consequence is that the CAZyme analysis of a 
genome performed with one method usually cannot be 
compared with that done with another. 

There are two ways to get a genome analysed by CAZy: 
if the genome and encoded proteins are deposited as 
finished entries in GenBank (or EMBL or DDBJ) they 
will be analysed by our daily routines. Alternatively, if 
one wishes to perform a CAZy analysis before deposition 
to GenBank (or EMBL or DDBJ), one should approach 
us for collaboration. Metagenomic data are analysed ex- 
clusively in collaboration due to their usual large size. 



DISPLAY OF STRUCTURAL INFORMATION 

The CAZy database is not only used by those who wish to 
analyse genomes, but also by structural biologists who 
study the molecular details of substrate recognition by 
CAZymes. Until September 2013, the only information 
available in the structure pages of CAZy was the accession 
and macromolecule chain name(s) in the PDB (http:// 
www.rcsb.org) (19). We have made a series of develop- 
ments to provide additional information relevant to the 
3D structure of CAZymes such as the resolution (for 
crystal structures) and a description of the carbohydrate 
hgands found in the CAZyme binding sites. 

The resolution information is straightforward to 
generate, as it is present in the PDB files of structures 
solved by x-ray crystallography. When the resolution in- 
formation is unavailable in the PDB file, the type of ex- 
perimental method by which the structure was solved is 
given instead (powder diffraction or nuclear magnetic 
resonance). 

On the other hand, the PDB does not provide any 
option to perform a comprehensive search for carbohy- 
drate structures found in CAZyme binding sites and, 
unhke proteins or nucleic acids, the nomenclature for 
carbohydrate residues within PDB files is not standardized 
(20). In addition, the information on how the isolated 
carbohydrate residues are linked to each other is not 
described in PDB files. We thus extract the carbohydrate 
hgand information from PDB files using PDB-care (http:// 
www.glycosciences.de/tools/pdb-care/) (21,22). The carbo- 
hydrate molecules covalently linked to an Asn or a Ser/ 
Thr residue were discarded to eliminate N- and 0-glycans 
to identify the carbohydrate hgands bound to CAZyme 
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Table 2. Examples of carbohydrate ligands treated manually 



Category 


Common name 


Display ni CAZy structure pages 


Example of 
PDB file 


Nonrcducing 


cx-cyclodextrm 


OL-cyclodextrin 


3EDF 


oligosaccharides 


P-cyclodextrin 


p-cyclodextrin 


3CGT 






^ 1_^ \J IL, M V a / U ±V J. 1 L4.1 


4FFH 




R a TTi n c 


'v-D-daln-* 1 -fiVrv-D-nicn-n -71-R-n-Friif 


1W2T 






'y-D-dlrn-n -7VR-r)-Fnif-M -7UR-n-Fnif 

u(. !_-/ VJlt,LJ-\ 1 ~jL f I) U I L lAl y 1 Z. j 13 LJ I i Lli 


3LDR 






-/-n-Glcn-n -7VR-n-Friif-n -7VR-n-Friif-n -?VR-n-Friif 

^ 1_^ VJ IC M V a / yj \^ ± 1 Ltl l A ^ / IJ l_v 1. 1 Lll l 1 A, J VJ LJ X. 1 Lll 


3LEM 


Tni r»_(^li (Tr^QH r^r'nn nHf^c 


T n 1 ri-ppl 1 r\ ni r\ c 


R-D-Olcn-n -4VR-r)-ni(-n4S 


4IPM 




1 ilitj itlllllllell 1 UlUaC 


R-D-nlrn-n -^VR-D-Glcn^S 

U-iV-VJlt, L7-y 1 - J j ij KJy H^IJ JlJ 


1J8V 




1. Ill W A. y L/L-ll LCIVJ 


R-D-Xvln-f 1 -4)-R-n-Xvln4S-n -4"l-R-n-Xvln4S-n -4VR-r)-Xvln4S-( 1 -4V 
R.r)-Xvln4S 


3CUJ 




ot-methyl-thio-cellopentaoside 


p-D-Glcp-(l-4)-p-D-Glcp4S-(l-4)-P-D-Glcp4S-(l-4)-p-D-Glcp4S-(l-4)- 
a-D-Glcp4S-(l-l)-methyl 


1H5V 




^_fl 1 Tr\m_rt_T~l_<Tl i if^r^cp 
J 11 UUi u U 1^ glLlUJSC 


R-D-Glcn^F 

U i_y - vJ It, IJ J 1 


4AMX 




^UCUAy Z-llLUJl U La l^^l LIUJSC 




lUYQ 




J-11 UUl u-p-i^~Ayiuac 




2XVK 


3,6-anhydro 


Neoagarohexaose 


a-L-3,6-anhydro-Galp-(l-3)-P-D-Galp-(l-4)-OL-L-3,6-anhydro-Galp-(l-3)- 


2CDO 


oligosaccharides 




P-D-Galp-(l-4)-a-L-3,6-anliydro-Galp-(l-3)-P-D-Galp 






Porphyran/agarose 


oc-L-Galp6S03-(l-3)-Qi-D-Galp-(l-4)-oi-L-3,6-anhydro-Galp-(l-3)- 


4AW7 




hexasaccharide 


P-D-Galp-( 1 -4)-0(-L-Galp6SO3-{ 1 -3)-a-D-Galp 






Agarooctaose 


a-L-3,6-anhydro-Galp-( 1-3 )-p-D-Galp-( 1 -4)-a-L-3,6-anhydro-Galp-( 1-3)- 
P-D-Galp-(l-4)-a-L-3,6-anliydro-Galp-(l-3)-P-D-Galp-(l-4)- 
a-L-3,6-anhydro-Galp-(l-3)-p-D-Galp 


4ATF 


Acarbose and 


Acarbose 


<non_carb>-(l-4)-a-D-6-deoxy-Glcp4N-(l-4)-a-D-Glcp-(l-4)-P-D-Glcp 


3ZOA 


its derivatives 


Acarbose-derived trisaccharide 


<non carb>-(l-4)-Qi-D-6-deoxy-Glcp4N-(l-4)-ot-D-Glcp 


IXCW 




Acarbose-derived pentasaccharide 


C)!-D-6-deoxy-Glcp4N-( 1 -4)-oi-D-Glcp-( 1 -4)- <non carb >-(0-4)- 
a-D-6-deoxy-Glcp4N-(l-4)-a-D-Glcp-(l-4)-P-D-Glcp 


IPIG 



active sites. The latter are shown in the structure pages of 
CAZy following their lUPAC nomenclature. 

Not all carbohydrate structures are susceptible to auto- 
mated description by PDB-care. In a number of cases, we 
have manually curated and provided lUPAC descriptions 
for structures that are unsuitable to PDB-care: (i) 
nonreducing glycans (cyclodextrins, sucrose and sucrose de- 
rivatives, trehalose, kestose, raffinose, nystose, etc.), (ii) 
ligands that contain both carbohydrate and noncarbohy- 
drate moieties such as acarbose and acarbose derivatives, 
(iii) sulfur-containing oligosaccharides, (iv) fluorine-con- 
taining carbohydrates and (v) oligosaccharides containing 
3,6-anhydro bridges. Table 2 displays examples of the 
manually handled cases. In addition, automated scripts 
have been devised to handle ~ 1 80 carbohydrate analogues 
that we denote <carb_like_ligandref> where ligandref cor- 
responds to the three-letter ligand name given by the PDB. 
For instance, the carbohydrate-like inhibitor 
1-deoxynojirimycin appears as <carb_like_NOJ>. The 
structural biology community is invited to contact us to 
report the possible errors that might have slipped through 
our curation process. 

As of September 2013, >1400 CAZymes and modules 
thereof have a known 3D structure, corresponding to 
almost 6000 PDB entries out of which ~1500 carbohydrate 
(or carbohydrate analogue) ligands are now identified and 
presented in the structure tab of each CAZy family. 



FUTURE DIRECTIONS 

CAZy is a knowledge-based resource that aims to link the 
sequence, the specificity and the 3D structural features 
of CAZymes. How these enzymes achieve selective 



recognition of target substrates that display only subtle 
stereochemical differences is key to prediction of substrate 
specificity. While this is already achievable for a few 
subfamilies, we are stfll a long way from a reUable auto- 
mated substrate (and/or product) prediction for all 
CAZymes encoded by a genome. We believe that subfam- 
ily-based target selection for experimental investigation of 
CAZymes will progressively fill the knowledge gap that 
will allow reliabihty in future functional predictions. 
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